Group 1, Mini Project. Ben Goodwin, Andre Mauldin

Resubmission Summary

Create Models

Based on the feedback from our first submission, this entire section was re-done from scratch. We created both a LR model and a SVM model to classify our data based on the ('ON_TIME_ARRIVAL') variable that we created below, using our 80% accuracy classification metric as our measure of classifer success.

LR model changes: The data went through the same basic manipulation, certain variables that were of no use or that could be derived from other variables were removed. This time we retained categorical variables and one-hot encoded, and all variables not already on a 0-1 scale were rescaled. We also had a 2/3 to 1/3 outcome class imbalance which we corrected for by resampling using SMOTE. We then clssifed on the scaled, hot-encoded, and resampled data using a random seed on the CV object. We achieved an 89% accuracy on the first attempt, we then adjusted the cost paramater to increase accuracy and achieved around 93% accuracy when modifying the cost parameter. Unlike our previous submission, our CV worked, and gave reliable, usable output.

SVM model changes. We used the same data as we manipulated for the LR portion of this assignment. However we did not scale the data and we did not use SMOTE to resample. We still retained categorical variables and one-hot encoded ones. We used 5 fold CV (scaled) data to classify with the SVM. We created a widget to adjust parameters for SVM including cost, gamma, and the kernel. We tested various iterations with a linear, rbf, and poly kernel and varied the cost and gamma values to determine best fit for our data.

Model Advantages

This section was completely re-worked. We included a table with results from each modeling run and a discussion to answer the questions: "Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or accuracy? In terms of training time or efficiency? We explained our results in great detail and our thoughts and analysis of both the LR and SVM models. We also reduced our lengthy introduction of

Interpret Feature Importance

Since the model was completley overhauled, and our original interpretations of the feautres were incorrect, this section was completely re-done. We interpreted each and every feature and its relationship with the response.

Interpret Support Vectors

Since the model was completely overhauled, and we did not review all of the features we selected, this section was completely re-done. We reviewed 7 features instead of our original three, and captured each of the main categorizes of feature.

Create Models

First, we import the necessary packages and the data set. We are continuing our use of the DOT airline statistics and investigating techniques to better help travelers determine how to avoid delays when flying. We have broken the project into the required subsections:

-Create Models

-Cleaning up the data
-Code Resuse
-Train/Test Split
-Logistic Regression
-Support Vector Machine

-Model Advantages

-Interpret Feature Importance

-Interpret Support Vectors

Cleaning up the Data

Next, we create an array of each type of delay. This will be helpful as we will need to handle the missing values for the delays.

We are dropping rows that are NA for arrival time, expected arrival time, actual elapsed time, and air time. We are only interested in flights that arrived at their destination.

Arrival time is needed to calculate OTA in order to train the model. ACTUAL_ELAPSED_TIME and AIR_TIME are related to the how long it took to flight to arrive at it's destination. We drop the NAs because only flights that actually arrive will have a value.

For our delay variables, we set the NAs to 0.0, meaning there was no delay. The previous value wa NaN which can't be calculated using logistic regression.

Perform One-Hot encoding of categorical data

Code Reuse

We are reusing code from the data mining notebook at https://github.com/jakemdrew/DataMiningNotebooks, to perform the training and test split, logistic regression, support vector machine, and interpretin feature importance.

Training and Test Split

To split up the data into a training and test data set we used an 80% training data and 20% testing data split. As specified in our first project we used a 5 fold cross validation as our metric. We also should specify that we are judging model performance based on an accuracy metric. We are aiming for at least 80% accuracy in order to consider an algorithm successful. As a reminder here is the formula we use for accuracy in our confusion matrices:

$accuracy = {\frac{True Positive+True Negative}{True Positive + True Negative +False Positive +False Negative}}$

SVM

Model Advantages

Model Type Cost Gamma Accuracy Time Kernel
Logistic 1.0 N/A 0.8932 6.82s (5 iterations) N/A
Logistic 2.5 N/A 0.9237 6.85s (5 iterations) N/A
Logistic 4.95 N/A 0.9222 6.64s (5 iterations) N/A
SVM 1.0 0.10 .998 1m (5 iterations) linear
SVM 2.5 0.5 .999 1.2m (5 iterations) linear
SVM 5.0 1.0 .997 1.1m (5 iterations) linear
SVM 1.0 auto 0.85 2m 35m (5 iterations) RBF
SVM 2.5 0.5 0.97 5m 25s (5 iterations) Poly

The results from each modeling run are reasonable, however there are a few key differences in the performanece in terms of prediction accuracy, training time, and efficiency. General Conclusion: The logistic regression model can iterate though 5 different folds in a matter of seconds and achieves between 89%-92% accuracy. Logistic regression offers superior performance in terms of prediction accuracy. (I say this despite the SVM reaching 99% accuracy, as I feel SVM with linear kernel overfits the data). I believe that in such a large dataset there is a fair trade-off between LR's prediction accuracy being sometimes lower in exchange for its much faster execution time and somewhat easier interpretation.

The logistic regression required a bit more pre-processing of the data. We had to scale the data, hot-encode, and re-sample using SMOTE to get reliable results from the LR model. However, since we put in the effort into pre-processing the data, the modeling aspect of the LR model was straightforward. The assumptions were met and the results seemed reasonable, and fit our metric of success of classification accuracy of >.8. We can see the classifer performs because it properly buckets on time and not on time flights and doesn't simply classify all as on time (which could still yield ~60% accuracy due to the data split), resampling and proper sampling weights fixed this potential issue. The execution time of the logistic regression model is also many orders of magnitude shorter than the competition, making this the preferred model on this merit alone. With model turning (cost parameter) we can also get the classification into the low 90's.

The SVM model was more difficult, and mainly becuase of the extremely long computation times required to test each iteration of the alogrithm. However, this model can be more finely tuned (cost, gamma, and kernel). The linear kernel ran the quickest, but produced somewhat unbelieveable results (prediction accuracy as high as 1). However after some work the poly kernel (also produced unbelieveable results), and the rbf kernel produced usable results. The rbf kernel, despite a long runtime, created beleiveable results around .85% accuracy. However these are lower than what the LR model produced and had a much longer compute time. Still, it was good to see another model validate the data.

Interpret Feature Importance

ACTUAL_ELAPSED_TIME

As this is a numeric variable, the interpretation is that all else being equal longer flights are less likely to be delayed (late arrival) this is possible because although flights sometimes leave late, they can make up for lost time en-route, and this is confouded on longer flights. </mark>

ARR_TIME

As this is a numeric variable, the interpretation does not make sense and this variable should be removed from the dataset.

DEP_DELAY

As this is a numeric variable, the interpretation is that all else being equal departure delayed flights are less likely to be delayed (late arrival) this is possible because although flights sometimes leave late, they can make up for lost time en-route, and this is confouded on longer flights. This is somewhat counter-intutive, but we must realize that a depature delay is late as soon as one minute passes since depature time. The incentive to make up for time could be great here.

WHEELS_ON

As this is a numeric variable, the interpretation is that all else being equal, flights with longer wheels on the ground are less likely to be delayed (late arrival) this is possible because although flights sometimes leave late, they can make up for lost time en-route, and this is confouded on longer flights. Wheels-on is when the wheels touch down at the arrival airport. As this is a numeric variable and a time stamp, the interpretation does not make sense and this variable should be removed from the dataset.

DEP_TIME

As this is a numeric variable and a time stamp, the interpretation does not make sense and this variable should be removed from the dataset.

WHEELS_OFF

As this is a numeric variable, the interpretation is that all else being equal, flights with longer wheels on the ground are less likely to be delayed (late arrival) this is possible because although flights sometimes leave late, they can make up for lost time en-route, and this is confouded on longer flights. Wheels-on is when the wheels leave the ground at the departure airport. As this is a numeric variable and a time stamp, the interpretation does not make sense and this variable should be removed from the dataset.

SECURITY DELAY

Security delays are the least impactful forms of delay, these are often short in nature and are the delay type most associated with a flight arriving on time.

Operating Carriers

Since there are 18 operating carriers all with similar weights, I will interpert these as a whole and note anything of interest on particular carriers

ALl things being equal, operating carrier on its own seems to have very little to do with delays. All operating carriers have positive weights meaning that they are associated with delays. Spirit Airlines being the least impactful offender and JetBlue carrying the most weight when it comes to a delayed arrival. An interpretation of this could be that all things held equal, if a JetBlue flight is delayed, of all the carriers it will be most likely to arrive late too.

CRS_DEP_TIME

As this is a numeric variable and a time stamp, the interpretation does not make sense and this variable should be removed from the dataset.

AIR_TIME

As this is a numeric variable, the interpretation is that all else being equal, the more air time, the more likely a flight is to arrive on time. We have previously discussed a reason for this being more air time is more of an opportunity to make up for the various forms of depature delays.

ORIGIN

Since there are 10 origin airports all with similar weights, I will interpret these as a whole and note anything of interest on particular origins

As this is a numeric variable, the interpretation is that all else being equal, flights originiating out of MCO (orlando) are twice as likely to have an arrival delay as a flight departing from (DFW). This could be for many reasons, MCO is on average further away from destinations than DFW, and events causing departure delays as MCO could have a longer span than delay events originating at DFW. The other 8 origins fall somewhere between.
SFO is the worst for delays, and is for a variety of reasons, weather, geography, airport congestion.

NAS_DELAY

As this is a numeric variable, the interpretation is that all else being equal, this is a delay in the form of weather delays that are caused by the air system. These types of delays are not very impactful on arrival times.

MONTH

Since there are 12 months all with similar weights, I will interpret these as a whole and note anything of interest on particular months

As this is a numeric variable, the interpretation is that all else being equal, flights originating in April are about twice as likely to have an arrival delay as a flight departing in January (November is another good month to avoid depature delays). This could be for a variety of reasons, weather, seasonality of travel.

This was actually a very good finding, to check our results I googled, "best and worst months to travel" and our results are matched.

https://www.refund.me/best-months-to-travel/

WEATHER_DELAY

As this is a numeric variable, the interpretation is that all else being equal, this is a delay in the form of weather delays, this is out of everyones control and is beginning to have an impact on arrival times.

LATE_AIRCRAFT_DELAY

As this is a numeric variable, the interpretation is that all else being equal, this is a delay in the form of an inbound aircraft is late for some reason (potentially one of the delay forms) and this causes ripple effects in the system.

CARRIER_DELAY

As this is a numeric variable, the interpretation is that all else being equal, this is a delay in the form of an airline induced delay. This could be maintenance, crew scheduling, late connecting passengers, and other things on the operations side of an airline. This is the type of delay most commonly associated with heavy (arrival) dealys.

CRS_ARR_TIME

As this is a numeric variable and a time stamp, the interpretation does not make sense and this variable should be removed from the dataset.

CRS_ELAPSED_TIME

As this is a numeric variable and a measure of a flights length the interpretation does not make sense and this variable should be removed from the dataset.

</mark>

Interpret Support Vectors

15296, 55

The indices of support vectors is 15296 with 55 features.

7412 7884

This is the number of support vectors for each class. 'ON_TIME_ARRIVAL' has two classes 0='YES', 1='NO'

Support vectors are the data points that lie $closest$ to the decision surface and are the points that are most difficult to classify.

Feature Reviews

The features selected were ACTUAL_ELAPSED_TIME, DEP_DELAY, ARR_TIME, CRS_ELAPSED_TIME, Operating Carrier_WN, ORIGIN_SFO, and CARRIER_DELAY.

These features were selcted for their uniqueness in the dataset, ACTUAL_ELAPSED_TIME is an interesting one since this is a continuous variable and is a measure of the length of a flight, as it turns out elapsed time is associated with on time flights, although this is problematic since longer flights are often a result of going far distances, the support vectors here may throw longer flights on the on time side since they behave in the way descibed above. The seperation is not as great as the original distribution. This variable seems to be linearly seperable.

DEP_DELAY has less seperation than the original distribution, and this makes sense. A flight that has a departure delay is most likely going to have an arrival delay as well, this isn't always the case, but a strong indicator. The decision boundary here is a strong one and seperates the data well. This variable seems to be (very) linearly seperable.

ARR_TIME support vectors have less seperation when compared to the original data, however the distribution of data is roughly the same. This variable was chosen at random to help our understanding of support vecotrs, and to see if this variable has good performance on generalization.

CRS_ELAPSED_TIME has a bit less seperation than the original data, however there still seems to be linearly seperable. CRS_ELAPSED_TIME is also closely related to ACTUAL_ELAPSED_TIME and suffers from the same issues, namely assuming a flight has arrived late simply because it was a long flight.

Operating Carrier_WN was selected because Southwest has an interesting position of often leaving late, but arriving on time. This could present an interesting problem for the decision boundary. The support vectors very closely match the disribution of the original data.

Origin_SFO was selected due to its weather problems, and these problems are associated with late arrivals. If the aiport has bad weather, arriving flights will sometimes be delayed. The seperation of the support vectors is nearly identical to the original distribution of the data. This seems like the variable has good performance on generalization.

CARRER_DELAY was chosen as it is the most often reason for a flight to arrive late. This is interesting because this delay reason is entirely in the hands of the operating airline, unlike a something like a weather delay. The support vectors nearly match the distribution of the data, except they have less seperation than the original. This is a tough measure, because these delays are often very short and don't always result in an arrival delay.